ALL THE METADATA'S A STAGE NOTEBOOK
Exploratory Bibliographic and Text Network Analysis of British Comedy Dramas Across 17th-19th Centuries
Introduction to the Project
This project explores the **British Library’s bibliographic dataset of digitized British dramas from the 17th–19th centuries**, using an **exploratory distant reading** approach.The workflow of this notebook proceeds in three stages:
Descriptive Statistics and Visualization of the Dataset
Provides a structured overview of the dataset for both the researcher and the reader, helping understand dataset size, structure and suitability for further analysis.Exploratory Data Analysis on Comedy Subdataset
Investigateso examine temporal trends in publications, spatial networks of publishing locations, authorship distribution, and lexical patterns of comedy titles.Text Mining and Network Analysis
Constructs a syntactic dependency network of frequent lemmas in comedy titles to address the central research question:How do 17th–19th century British comedy titles encode gendered roles and archetypes, and what do they reveal about the social expectations of the period?
Dataset
- Original Source: British Library digitized drama metadata, cleaned as part of a group assignment for Introduction to Digital Humanities CLass, taught by Dr. Margherita Fantoli, in academic year 2025-26 at KU Leuven (Msc Digital Humanities).
- Data Wrangling Process:
- OpenRefine used to wrangle data while preserving original information.
- Facet filtering to isolate subsets and identify inconsistencies.
- GREL scripting to standardize and transform columns.
- WikiData reconciliation to enrich author information with contextual metadata.
- Dataset Structure: Three subdatasets: Comedy, Tragedy, Plays cleaned by Chahna Ahuja, Xinran Liu and Liangyu Gan, respectively.
- Repository: All raw, cleaned datasets, JSON histories, scripts, and data wrangling workflows are archived in a shared GitHub repository for reproducibility and transparency.
Data Analysis Methods in Python & Inspirations
Tools:
- Jupyter Notebook (Python) (data analysis & visualization),
- Gephi (network visualization),
- Plotly, WordCloud (interactive charts),
- spaCy & NLTK (text mining),
- WikiData API (data enrichment).
References Books and Interactive Guides:
- Sarkar, Dipanjan. 2019. Text Analytics with Python: A Practitioner's Guide to Natural Language Processing. New York: Springer.https://doi.org/10.1007/978-1-4842-4354-1
- Karsdorp, F., Kestemont, M., & Riddell, A. 2021. Humanities Data Analysis: Case Studies with Python. Princeton University Press. https://www.humanitiesdataanalysis.org/
- Learning Statistics with Python: Interactive Notebook
Documentation & APIs:
- Pandas for data analysis and manipluation.
- Plotly Documentation for interactive visualizations
- NLTK Documentation for text preprocessing and linguistic analysis
- spaCy Documentation for tokenization, lemmatization, and dependency parsing to build syntactic dependency network
- WikiData Package for Python for enriching author metadata with respective genders.
Notebook Design: The interactive jupyter notebook format allows combined code, output, and narrative for interactive exploration. This is also why Plotly was preferred over static matplotlib charts to support this interactivity.
Reference Inspirations for Notebook Design :
- Karsdorp, F., Kestemont, M., & Riddell, A. 2021. Humanities Data Analysis: Case Studies with Python. Princeton University Press. https://www.humanitiesdataanalysis.org/
- Magister Dixit Notebook
- Learning Statistics with Python: Interactive Notebook
- Tribune Collection Metadata Magic Notebook
Github Repository: All raw and cleaned datasets (CSV files), scripts, visual assets, and figures generated in this notebook—as well as the notebook itself—are archived in a GitHub repository for documentation and storage purposes. The original dataset can be accessed here, and the modified dataset is available here!
#import html to display scrollable tables, visualization etc.
from IPython.display import display, HTML
from IPython.display import Image, Markdown
#header image
img=Image(url= "https://raw.githubusercontent.com/chahna-ahuja/dh_project/refs/heads/main/visual_assets/header_image.jpg")
display(img)
display(Markdown('''Figure 1: Rowlandson, Thomas. *Drury Lane Theatre.* 1808.
In Augustus Charles Pugin, Thomas Rowlandson and William Combe, *Microcosm of London* (British Library)
https://www.britishlibrary.cn/en/articles/theatre-in-the-19th-century/'''))
Figure 1: Rowlandson, Thomas. Drury Lane Theatre. 1808. In Augustus Charles Pugin, Thomas Rowlandson and William Combe, Microcosm of London (British Library) https://www.britishlibrary.cn/en/articles/theatre-in-the-19th-century/
Table of Contents¶
1. Set up the Notebook and Dataset
2. Explore the Dataset
2.1 Drama Dataset Size, Structure & Suitability (Descriptive Statistics)
2.1.1 Records in the Dataset Across Genres
2.1.2 Dataset Columns and its Structure
2.1.3 Sample Records from Each Genre
2.1.4 Temporal Distribution of Drama Records in the Dataset
2.1.5 Most Common Publishing Locations in the Dataset
2.1.6 Patterns of Missing Data in Dataset Columns
2.1.7 Missing Values Across Genres: Determining the Best Genre Subset for Further Analysis
2.2 Probing Inside the Comedy Drama Subdataset (Exploratory Data Analysis)
2.2.1 Temporal Distribution of Comedy Dramas by Decade Between 17th-19th Centuries
2.2.2 Publishing Locations Network for Comedy Dramas Between 17th to 19th Centuries
2.2.3 Overview of British Comedy Drama Authors in the Dataset
2.2.4 Most Profilic English Comedy Authors with Most Number of Plays by Century
2.2.5 Authorship Concentration in English Comedy Dramas Across 17th-19th Centuries
2.2.6 Gender Distribution of English Comedy Authors Across 17th-19th Centuries
2.2.7 Differences in Concentration of Comedy Dramas Between Male and Female Authors Across 17th-19th Centuries
3. Exploratory Text Network Analysis
3.1 Setting Up Text Preprocessing Pipeline
3.2 Lexical Patterns in Titles: Text Analysis
3.2.1 Title Length and Token Distribution
3.2.2 Most Frequent Lemmas in English Comedy Drama Titles
3.3 Syntactic Dependency Network of Top 20 Lemmas
1. Set up the Notebook and Dataset¶
- Import Packages, Modules and Libraries used in this project
- Load Dataset CSV from Github
- Divide the CSV into subdatasets by genre
#pandas for data analysis and manipulation tool
import pandas as pd
#to read, write, import csv
import csv
#for os commands
import os
#plotly for interactive data visualization
#it works better for a web dashboard compared to matplotlib
# and it is as customizable as seaborn
import plotly.express as px
# plotly's 'go' to access all the core graph object classes
# (figure, bar, scatter, layout, Axes, etc.) for building and customizing plots
import plotly.graph_objects as go
#to make multiple charts/figure based on plotly too!
from plotly.subplots import make_subplots
#wikidata package to access content from wikidata urls and ids
import wikidata
#to fetch images from wikidata
from wikidata.client import Client
#to fetch qids from wikidata and use them to identify author's gender
import requests
#to make density curve, plotly didn't have an option for this!
from scipy.stats import gaussian_kde
#to make wordclouds, plotly didn't have an option for this!
from wordcloud import WordCloud
#for arrays used in certain data analysis
import numpy as np
#for text analysis/text mining
import re #regular expressions for text normalization
import string #string for puncutation function
import nltk #for nlp techniques
import spacy #for nlp techniques
from nltk.corpus import stopwords #stopwords in english from nltk
#to count frequencies
from collections import Counter
#to assign automatic keys for network building
from collections import defaultdict
#read whole dataset
df=pd.read_csv(r"https://raw.githubusercontent.com/chahna-ahuja/dh_project/refs/heads/main/data/dh_group7_drama.csv")
#dataset divided by genre
df_comedy = df.loc[df['type'].str.lower() == 'comedy'].copy() #comedy subset
df_tragedy= df.loc[df['type'].str.lower() == 'tragedy'].copy() #tragedy subset
df_play= df.loc[df['type'].str.lower() == 'play'].copy() #play subset
2. Explore the Dataset¶
2.1 Drama Dataset Size, Structure & Suitability (Descriptive Statistics)¶
- How many records are in the dataset and how are they distributed across genres?
- What columns and data types structure the dataset?
- Describe some sample records from each genre (comedy, tragedy, play).
- What is the temporal distribution of Comedy, Tragedy, and Play records in the drama dataset?
- What are the top publishing locations in the dataset?
- What are patterns of missing data across dataset columns?
- Considering the key bibliographic columns (author, author wikidata, title, publishing year), which genre subset is most suitable for further analysis based on their percentage of missing values?
2.1.1 Records in the Dataset Across Genres¶
How many records are in the dataset and how are they distributed across genres?
total_rows, total_cols = df.shape #shape of total records
comedy_rows, comedy_cols = df_comedy.shape
tragedy_rows, tragedy_cols = df_tragedy.shape
play_rows, play_cols = df_play.shape
# Display cards
display(HTML(f"""
<h3>2.1.1 How many records are in the dataset and how are they distributed across genres?</h3>
<div style='display:flex; gap:30px; margin-bottom:20px;'>
<div style='padding:15px; border-radius:8px; background:#069494; text-align:center; width:180px;'>
<div style='font-size:18px; font-weight:bold;'>Total Dataset</div>
<div style='font-size:28px; font-weight:bold;'>{total_rows}</div>
<div>Records</div>
<div style='margin-top:5px; font-size:18px;'>{total_cols} Columns</div>
</div>
<div style='padding:15px; border-radius:8px; background:#F3A533; text-align:center; width:180px;'>
<div style='font-size:18px; font-weight:bold;'>Comedy</div>
<div style='font-size:28px; font-weight:bold;'>{comedy_rows}</div>
<div>Records</div>
<div style='margin-top:5px; font-size:18px;'>{comedy_cols} Columns</div>
</div>
<div style='padding:15px; border-radius:8px; background:#BC5090; text-align:center; width:180px;'>
<div style='font-size:18px; font-weight:bold;'>Tragedy</div>
<div style='font-size:28px; font-weight:bold;'>{tragedy_rows}</div>
<div>Records</div>
<div style='margin-top:5px; font-size:18px;'>{tragedy_cols} Columns</div>
</div>
<div style='padding:15px; border-radius:8px; background:#ff0f39; text-align:center; width:180px;'>
<div style='font-size:18px; font-weight:bold;'>Play</div>
<div style='font-size:28px; font-weight:bold;'>{play_rows}</div>
<div>Records</div>
<div style='margin-top:5px; font-size:18px;'>{play_cols} Columns</div>
</div>
</div>
"""))
2.1.1 How many records are in the dataset and how are they distributed across genres?
from IPython.display import display, HTML
import pandas as pd
# column definitions
column_definitions = [
"Unique ID",
"Unique ID",
"Original personal author column",
"Cleaned version of author (o)",
"Wikidata URL from author (n)",
"Wikidata identifier from author (n)",
"Original title column",
"Cleaned and standardized title",
"Supplementary title information",
"Standardized edition data",
"Original secondary authors",
"Cleaned secondary authors",
"Standardized series information",
"Standardized subject terms",
"Original imprint column",
"Publishing year extracted from imprint",
"Publishing location extracted from imprint",
"Publisher name extracted from imprint",
"`Comedy', `Tragedy' and `Play' as types"
]
#to interpret pandas dtype
def interpret_dtype(dtype):
if dtype == "int64":
return "Integer values"
elif dtype == "float64":
return "Float values with possible missing data"
elif dtype == "object":
return "Textual or categorical metadata"
else:
return "Other / mixed data type"
# building the column info table
col_info = pd.DataFrame({
"Col.No.": range(1, len(df.columns)+1),
"Column Name": df.columns,
"Data Type": df.dtypes.astype(str),
"Type Description": [interpret_dtype(str(dtype)) for dtype in df.dtypes],
"Column Definition": column_definitions
})
# convert to html for display
col_info_html = col_info.to_html(index=False, escape=False)
# html display tags
display(HTML(f"""
<h3>2.1.2 What columns and data types structure the dataset? </h3>
<p>
This table summarizes the dataset’s columns, its data types, interprets
the type, and provides a brief column definition for each bibliographic field.
</p>
<div style="overflow-x:auto; border:1px solid #ccc; padding:6px;">
<style>
table {{
border-collapse: collapse;
width: 100%;
table-layout: auto;
}}
th {{
background-color: #84d1d1 !important;
padding: 3px 5px;
text-align: left !important;
}}
td {{
padding: 2px 5px;
white-space: nowrap;
}}
tr:nth-child(even) {{
background-color: #069494;
}}
</style>
{col_info_html}
</div>
"""))
2.1.2 What columns and data types structure the dataset?
This table summarizes the dataset’s columns, its data types, interprets the type, and provides a brief column definition for each bibliographic field.
| Col.No. | Column Name | Data Type | Type Description | Column Definition |
|---|---|---|---|---|
| 1 | aleph system no. | int64 | Integer values | Unique ID |
| 2 | dom id | object | Textual or categorical metadata | Unique ID |
| 3 | author (o) | object | Textual or categorical metadata | Original personal author column |
| 4 | author (n) | object | Textual or categorical metadata | Cleaned version of author (o) |
| 5 | author_wikidata_url | object | Textual or categorical metadata | Wikidata URL from author (n) |
| 6 | author_wikidata_id | object | Textual or categorical metadata | Wikidata identifier from author (n) |
| 7 | title (o) | object | Textual or categorical metadata | Original title column |
| 8 | title (n) | object | Textual or categorical metadata | Cleaned and standardized title |
| 9 | metadata | object | Textual or categorical metadata | Supplementary title information |
| 10 | edition | object | Textual or categorical metadata | Standardized edition data |
| 11 | other authors (o) | object | Textual or categorical metadata | Original secondary authors |
| 12 | other authors (n) | object | Textual or categorical metadata | Cleaned secondary authors |
| 13 | series | object | Textual or categorical metadata | Standardized series information |
| 14 | subject | object | Textual or categorical metadata | Standardized subject terms |
| 15 | imprint (o) | object | Textual or categorical metadata | Original imprint column |
| 16 | p year | float64 | Float values with possible missing data | Publishing year extracted from imprint |
| 17 | p location | object | Textual or categorical metadata | Publishing location extracted from imprint |
| 18 | p name | object | Textual or categorical metadata | Publisher name extracted from imprint |
| 19 | type | object | Textual or categorical metadata | `Comedy', `Tragedy' and `Play' as types |
2.1.3 Sample Records from Each Genre¶
Describe some sample records from each genre (comedy, tragedy, play).
# function to display sample overview
display(HTML("<h3>2.1.3 Sample Records from Each Genre (Comedy, Tragedy, Play)</h3>"))
def sample_overview(df_type, type_name, n=5, header_color="#fce883"):
df_sample = df_type.head(n).copy()
if 'p year' in df_sample.columns:
df_sample['p year'] = df_sample['p year'].astype('Int64')
df_display = df_sample.fillna('')
html_table = df_display.to_html(index=False, escape=False)
html_table = html_table.replace(
'<th>',
f'<th style="background-color:{header_color} !important; text-align:left; padding:6px; white-space:nowrap;">'
)
display(HTML(f"""
<h4>Sample {n} Records: {type_name}</h4>
<div style='overflow-x:auto; width:100%; border:1px solid #ccc; padding:5px'>
<style>
table {{
border-collapse: collapse;
table-layout: auto !important; /* SMALL CHANGE */
width: max-content !important;
min-width: 800px;
}}
th, td {{
text-align: left;
padding: 4px;
min-width: 120px;
white-space: nowrap !important; /* KEY CHANGE: force one line */
vertical-align: top !important; /* prevent overlap illusion */
line-height: 1.3; /* stabilize row height */
}}
tr:nth-child(even) {{
background-color: #f2f2f2;
}}
</style>
{html_table}
</div>
"""))
# usage
sample_overview(df_comedy, "Comedy", n=5, header_color="#F3A533")
sample_overview(df_tragedy, "Tragedy", n=5, header_color="#BC5090")
sample_overview(df_play, "Play", n=5, header_color="#ff0f39")
2.1.3 Sample Records from Each Genre (Comedy, Tragedy, Play)
Sample 5 Records: Comedy
| aleph system no. | dom id | author (o) | author (n) | author_wikidata_url | author_wikidata_id | title (o) | title (n) | metadata | edition | other authors (o) | other authors (n) | series | subject | imprint (o) | p year | p location | p name | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14602858 | lsidyv2eb18738 | inchbald, mrs., 1753-1821 | inchbald, elizabeth | https://www.wikidata.org/wiki/Q469974 | Q469974 | next door neighbours; a comedy; in three acts [and in prose]. from the french dramas l'indigent [by l. s. mercier] and le dissipateur [by p. néricault destouches]. [electronic resource] | next door neighbours; a comedy; in three acts | [and in prose]. from the french dramas l'indigent [by l. s. mercier] and le dissipateur [by p. néricault destouches]. | destouches, néricault, 1680-1754 | destouches, néricault | london, 1791. | 1791 | london | comedy | ||||
| 14605599 | lsidyv3031fb89 | the traveller; or, the marriage in sicily [a comedy], in three acts [and in prose. by guglielmo]. [electronic resource] | the traveller; or, the marriage in sicily | [a comedy], in three acts [and in prose. by guglielmo]. | london, 1809. | 1809 | london | comedy | ||||||||||
| 14608193 | lsidyv3097b1af | claris de florian, jean pierre. | claris de florian, jean pierre | https://www.wikidata.org/wiki/Q551740 | Q551740 | look before you leap. a comedy in one act ... translated from ... la bonne mère ... by horatio robson. [electronic resource] | look before you leap. a comedy in one act | translated from...la bonne mère ... by horatio robson. | claris de florian, jean pierre. -- robson, horatio. | claris de florian, jean pierre. robson, horatio | london : harrison & co., 1788. | 1788 | london | harrison & co. | comedy | |||
| 14608240 | lsidyv32b79f0b | mercurius britanicus; or, the english intelligencer. a tragic-comedy, at paris. acted with great applause. [in four acts and in prose. by r. braithwait.] [electronic resource] | mercurius britanicus; or, the english intelligencer. a tragic-comedy | at paris. acted with great applause. [in four acts and in prose. by r. braithwait.] | brathwaite, richard, 1588?-1673. | brathwaite, richard | [london,] , 1641. | 1641 | london | comedy | ||||||||
| 14608260 | lsidyv32b62fd6 | sedaine, 1719-1797 | sedaine | https://www.wikidata.org/wiki/Q367243 | Q367243 | a key to the lock, a comedy in two acts. [translated from “la gageure imprévue.”] [electronic resource] | a key to the lock, a comedy in two acts | [translated from “la gageure imprévue.”] | london, 1788. | 1788 | london | comedy |
Sample 5 Records: Tragedy
| aleph system no. | dom id | author (o) | author (n) | author_wikidata_url | author_wikidata_id | title (o) | title (n) | metadata | edition | other authors (o) | other authors (n) | series | subject | imprint (o) | p year | p location | p name | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14602868 | lsidyv2eb15659 | manipure tragedy. composed in rhyming form. [electronic resource] | manipure tragedy | composed in rhyming form | chittagong : jagot chandra das, 1893. | 1893 | chittagong | jagot chandra das | tragedy | |||||||||
| 14603036 | lsidyv2eb362b1 | ramsdale, a. e. | ramsdale, a. e. | “then,” or, the tragedy of the alsager bazaar. [in verse.] [electronic resource] | “then,” or, the tragedy of the alsager bazaar | [in verse.] | longton : hughes & harber, [1889.]. | 1889 | longton | hughes & harber | tragedy | |||||||
| 14605683 | lsidyv32b541d0 | the tailors; a tragedy for warm weather, in three acts. [a burlesque. adapted by george colman, the elder.] [electronic resource] | the tailors; a tragedy for warm weather, in three acts | [a burlesque. adapted by george colman, the elder.] | colman, george, the elder. | colman, george, the elder | london : t. cadell, 1778. | 1778 | london | t. cadell | tragedy | |||||||
| 14608339 | lsidyv3351b2b4 | fragment of a tragedy lately acted at the british museum. [a burlesque, by s. weston on the loss of certain prints from mr. cracherode's collection.] [electronic resource] | fragment of a tragedy lately acted at the british museum | [a burlesque, by s. weston on the loss of certain prints from mr. cracherode's collection.] | weston, stephen, f.r.s. | weston, stephen | [london, 1806.] | 1806 | london | [ .] | tragedy | |||||||
| 14608380 | lsidyv31afc4db | sotheby, william. | sotheby, william | https://www.wikidata.org/wiki/Q8018620 | Q8018620 | julian and agnes; or, the monks of the great st. bernard. a tragedy, in five acts [and in verse]. [electronic resource] | julian and agnes; or, the monks of the great st. bernard. a tragedy, in five acts | [and in verse] | london, 1801. | 1801 | london | tragedy |
Sample 5 Records: Play
| aleph system no. | dom id | author (o) | author (n) | author_wikidata_url | author_wikidata_id | title (o) | title (n) | metadata | edition | other authors (o) | other authors (n) | series | subject | imprint (o) | p year | p location | p name | type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14602867 | lsidyv2eb1acdd | the queen's visit to birmingham! a playful reminiscence ... in random rhyme ... by ... quiz. [electronic resource] | the queen's visit to birmingham! a playful reminiscence | in random rhyme...by ... quiz | victoria, queen of great britain, 1819-1901. | victoria | birmingham : w. cornish, [1858.]. | 1858 | birmingham | w. cornish | play | |||||||
| 14602972 | lsidyv2eb2b217 | moculloch, philim. | moculloch, philim | the murphiad. a mock heroic poem. [an attack on arthur murphy, the dramatist.] [electronic resource] | the murphiad. a mock heroic poem | [an attack on arthur murphy, the dramatist.] | murphy, arthur, 1727-1805. | murphy, arthur | london : printed for j. williams, 1761. | 1761 | london | printed for j. williams | play | |||||
| 14603044 | lsidyv2eb33e33 | woodberry, george edward. | george edward woodberry | https://www.wikidata.org/wiki/Q5538895 | Q5538895 | the players' elegy on the death of edwin booth, etc. [electronic resource] | the players' elegy on the death of edwin booth, etc. | new-york : privately printed, 1893. | 1893 | new york | privately printed | play | ||||||
| 14603329 | lsidyv2eb77ab0 | a quarter of an hour before dinner; or, quality binding. a dramatic entertainment of one act [and in prose. by j. rose]. [electronic resource] | a quarter of an hour before dinner; or, quality binding. a dramatic entertainment of one act | [and in prose. by j. rose] | rose, john, rector of st. martin outwich, london. | rose, john | london, 1788. | 1788 | london | play | ||||||||
| 14603337 | lsidyv2eb7d779 | the spectre poverty: or the realities of life, displayed under an allegory. [in verse. by mrs. furlong.] [electronic resource] | the spectre poverty: or the realities of life, displayed under an allegory | [in verse. by mrs. furlong.] | edinburgh, 1834. | 1834 | edinburgh | play |
2.1.4 Temporal Distribution of Drama Records in the Dataset¶
What is the temporal distribution of Comedy, Tragedy, and Play records in the drama dataset?
# create analysis-specific data copies
df_comedy_dtime = df_comedy.copy()
df_tragedy_dtime = df_tragedy.copy()
df_play_dtime = df_play.copy()
# function to convert year into decade bin
def decade_bin(year):
"""
Converts a year into its decade.
Example: 1673 -> 1670
"""
return (year // 10) * 10
# remove rows where publication year is missing
df_comedy_dtime = df_comedy_dtime.dropna(subset=['p year'])
df_tragedy_dtime = df_tragedy_dtime.dropna(subset=['p year'])
df_play_dtime = df_play_dtime.dropna(subset=['p year'])
# convert publication year from float to integer
df_comedy_dtime['p year'] = df_comedy_dtime['p year'].astype(int)
df_tragedy_dtime['p year'] = df_tragedy_dtime['p year'].astype(int)
df_play_dtime['p year'] = df_play_dtime['p year'].astype(int)
# add decade bins
df_comedy_dtime['decade'] = df_comedy_dtime['p year'].apply(decade_bin)
df_tragedy_dtime['decade'] = df_tragedy_dtime['p year'].apply(decade_bin)
df_play_dtime['decade'] = df_play_dtime['p year'].apply(decade_bin)
# count number of plays per decade for each genre
comedy_counts = df_comedy_dtime['decade'].value_counts().sort_index()
tragedy_counts = df_tragedy_dtime['decade'].value_counts().sort_index()
play_counts = df_play_dtime['decade'].value_counts().sort_index()
# merge counts into one DataFrame
all_bins = sorted(set(comedy_counts.index) |
set(tragedy_counts.index) |
set(play_counts.index))
plot_df = pd.DataFrame({'decade': all_bins})
plot_df['Comedy'] = plot_df['decade'].map(comedy_counts).fillna(0)
plot_df['Tragedy'] = plot_df['decade'].map(tragedy_counts).fillna(0)
plot_df['Play'] = plot_df['decade'].map(play_counts).fillna(0)
# transform to long format using pandas.melt() for plotly
plot_df_melt = plot_df.melt(
id_vars='decade',
value_vars=['Comedy', 'Tragedy', 'Play'],
var_name='Genre',
value_name='Count'
)
#display html
display(HTML("<h3>2.1.4 Temporal Distribution of Drama Records in the Dataset</h3>"))
# create interactive line plot
fig = px.line(
plot_df_melt,
x='decade',
y='Count',
color='Genre',
markers=True,
title="Decade-Wise Distribution of Drama Records Across Genres in the Dataset",
labels={
'decade': 'Decade',
'Count': 'Number of Plays'
},
color_discrete_map={
'Comedy': '#F3A533',
'Tragedy': '#BC5090',
'Play': '#ff0f39'
}
)
# adjust layout
fig.update_layout(
xaxis=dict(
tickmode='linear',
dtick=10,
range=[1540, 1890],
tickangle=45,
tickfont=dict(size=10),
automargin=True,
showline=True,
linewidth=1,
linecolor='black',
title_text="Publishing Year (Decade)"
),
yaxis=dict(
tickformat='d',
dtick=10,
showline=True,
linewidth=1,
linecolor='black',
title_text="Number of English Dramas"
),
template='plotly_white',
width=1100,
height=600
)
# show plot
fig.show()
2.1.4 Temporal Distribution of Drama Records in the Dataset
2.1.5 Most Common Publishing Locations in the Dataset¶
What are the most common publishing locations in the dataset?
#display html
display(HTML("<h3>2.1.5 Most Common Publishing Locations in the Dataset </h3>"))
#copy dataframe as working df for this visualization
df_full = df.copy()
#fill missing locations as 'Null'
df_full['p location'] = df_full['p location'].fillna('null')
#split p locations which are multiple like 'london & new work'
#so that each location can be counted separately
df_full['p location'] = df_full['p location'].str.split('&')
# explode into separate rows
#creates a new row for each location in the list
#effectively "duplicating" the original row
# but allowing each location to be counted individually in the top locations
df_full = df_full.explode('p location')
# remove extra spaces
df_full['p location'] = df_full['p location'].str.strip()
# count occurrences to find top10
location_counts = df_full['p location'].value_counts()
# top 10 locations
top10 = location_counts.head(10)
# ensure 'Null' is included to keep data realistic
if 'null' not in top10.index:
top10['null'] = location_counts.get('null', 0)
# convert top 10 to df
treemap_data = pd.DataFrame({
'location': top10.index,
'count': top10.values
})
# custom colors from darkest to lightest
colors = ['#001219', '#9b2226', '#ae2012', '#bb3e03', '#ca6702',
'#ee9b00', '#e9d8a6', '#94d2bd', '#0a9396', '#005f73']
# sort descending by count so top location is most visible
treemap_data_sorted = treemap_data.sort_values(by='count', ascending=False).reset_index(drop=True)
# assign colors
treemap_data_sorted['color'] = ''
color_index = 1 # starting from second color because first is null
for i in range(len(treemap_data_sorted)):
if treemap_data_sorted.loc[i, 'location'] == 'null':
treemap_data_sorted.loc[i, 'color'] = colors[0]
else:
# use remaining remaining for top 10 locations
treemap_data_sorted.loc[i, 'color'] = colors[color_index % len(colors)]
color_index += 1
#create colormap for each location to show darkest to lightest
color_map = {}
for i in range(len(treemap_data_sorted)):
loc = treemap_data_sorted.loc[i, 'location']
col = treemap_data_sorted.loc[i, 'color']
color_map[loc] = col
# create treemap
fig = px.treemap(
treemap_data_sorted,
path=['location'],
values='count',
color='location',
color_discrete_map=color_map
)
# visualizing on clicking the rectangle
fig.update_traces(
textinfo='label+value',
hovertemplate='<b>%{label}</b><br>Count: %{value}<extra></extra>',
tiling=dict(pad=0.01),
marker=dict(cornerradius=1)
)
# add legend as data is too skewed to london
#so many rectangles won't be readable
for i in range(len(treemap_data_sorted)):
fig.add_trace(go.Scatter(
x=[None],
y=[None],
mode='markers',
marker=dict(size=15, color=treemap_data_sorted.loc[i, 'color']),
name=treemap_data_sorted.loc[i, 'location'],
showlegend=True
))
# adjust layout
fig.update_layout(
title='Top Publishing Locations for English Dramas (Along with Missing Values)',
margin=dict(t=50, l=25, r=25, b=25),
plot_bgcolor='white',
paper_bgcolor='white',
xaxis=dict(visible=False),
yaxis=dict(visible=False),
legend_title_text="Locations"
)
#show treemap
fig.show()
2.1.5 Most Common Publishing Locations in the Dataset
2.1.6 Patterns of Missing Data in Dataset Columns¶
What are patterns of missing data across dataset columns? Inspiration of Visualizing Missing Data in Python
display(HTML("<h3>2.1.6 Patterns of Missing Data in Dataset Columns</h3>"))
# manipulating dataframe to visualize null vs not null
dfNullCount = pd.DataFrame({
'Column': df.columns,
'Null': df.isnull().sum().values,
'Not Null': df.notnull().sum().values
})
# melt function in pandas to create null/not null as columns
df_melt = dfNullCount.melt(
id_vars='Column',
value_vars=['Not Null', 'Null'],
var_name='Values',
value_name='Count'
)
# create horizontal stacked bar chart
#using plotly px
#
fig = px.bar(
df_melt,
y='Column', # columns on y-axis
x='Count', # counts on x-axis
color='Values',
orientation='h', # horizontal bars
barmode='stack',
text='Count',
color_discrete_map={'Not Null': 'skyblue', 'Null': 'salmon'}
)
# preserve original column order (top to bottom)
fig.update_yaxes(
categoryorder='array',
categoryarray=df.columns[::-1], # reverse so first column is at top
ticklabelstandoff=15
)
# formatting numbers inside bars
fig.update_traces(textposition='inside', textfont_size=10)
# layout adjustments
fig.update_layout(
title={
'text': "Null vs Not Null Values per Dataset Column",
'x': 0.5,
'xanchor': 'center'
},
xaxis_title="Count of Values",
yaxis_title="Columns in the Dataset",
legend_title_text="Values",
height=700,
width=1000,
margin=dict(l=200),
plot_bgcolor='white',
paper_bgcolor='white',
xaxis=dict(dtick=200,showline=True, linewidth=1, linecolor='black', mirror=False),
yaxis=dict(showline=True, linewidth=1, linecolor='black', mirror=False)
)
fig.show() #show chart
2.1.6 Patterns of Missing Data in Dataset Columns
2.1.7 Missing Values Across Genres: Determining the Best Genre Subset for Further Analysis¶
Considering the key bibliographic columns (author, author wikidata, title, publishing year), which genre subset is most suitable for further analysis based on their percentage of missing values?
#header
#with total percent of missing values per genre formula
display(HTML("""
<h3 style="text-align:left;">2.1.7 Missing Values in Key Bibliographical Columns Across Genres</h3>
<p style="text-align:center; font-size:14px;">
$$
\\text{Total Missing \\% per Genre} =
\\frac{1}{n} \\sum_{i=1}^{n}
\\left(
\\frac{\\text{Missing cells in Column}_i}{\\text{Total rows in genre}} \\times 100
\\right)
$$
</p>
"""))
# key columns for my research questions
columns = ['author (n)', 'author_wikidata_url', 'title (n)', 'p year']
# function to calculate missing percentages per column
def missing_percent(df, genre_name):
total = len(df)
percent_missing = []
for col in columns:
missing_count = df[col].isna().sum()
percent = (missing_count / total) * 100
percent_missing.append(percent)
return pd.DataFrame({
'Columns': columns,
'Percent Missing': percent_missing,
'Genre': genre_name
})
# combine missing data for all genres
df_missing = pd.concat([
missing_percent(df_comedy, 'Comedy'),
missing_percent(df_tragedy, 'Tragedy'),
missing_percent(df_play, 'Play')
], ignore_index=True)
#for total missing % per genre
# create an empty DataFrame to store total missing per genre
total_missing = pd.DataFrame()
# group the data by genre
grouped = df_missing.groupby('Genre')
# sum the missing percentages for each genre
sum_missing = grouped['Percent Missing'].sum()
# put the sums into the new DataFrame
total_missing['Genre'] = sum_missing.index # genre names
total_missing['Sum Missing'] = sum_missing.values # sum of missing percentages
# calculate total missing percentage per genre by dividing by number of columns
total_missing['Total Percent Missing'] = total_missing['Sum Missing'] / len(columns)
#ensure order is comedy, tragedy, play
total_missing['Genre'] = pd.Categorical(total_missing['Genre'], categories=['Comedy','Tragedy','Play'], ordered=True)
total_missing = total_missing.sort_values('Genre')
#locate least mising genre based on total missing %
least_missing_genre = total_missing.loc[total_missing['Total Percent Missing'].idxmin(), 'Genre']
least_missing_value = total_missing['Total Percent Missing'].min()
# colors for each genre subdataset
colors = {'Comedy':'#F3A533', 'Tragedy':'#BC5090', 'Play':'#ff0f39'}
# create subplots
fig = make_subplots(
rows=1, cols=2,
subplot_titles=("Missing % per Column by Genre", "Total Missing % by Genre")
)
# left subplot: missing per column
#using go.bar module from plotly
#adding percentage above bar graph
for genre in df_missing['Genre'].unique():
df_g = df_missing[df_missing['Genre'] == genre]
fig.add_trace(
go.Bar(
x=df_g['Columns'],
y=df_g['Percent Missing'],
name=genre,
text=[f"{v:.1f}%" for v in df_g['Percent Missing']],
textposition='outside',
marker_color=colors[genre]
),
row=1, col=1
)
# right subplot: total missing per genre
# using go.bar module from plotly
for genre in total_missing['Genre']:
value = total_missing[total_missing['Genre'] == genre]['Total Percent Missing'].values[0]
fig.add_trace(
go.Bar(
x=[genre],
y=[value],
name=genre,
text=[f"{value:.1f}%"],
textposition='outside',
marker_color=colors[genre],
showlegend=False
),
row=1, col=2
)
# layout adjustments
fig.update_layout(
height=600, width=1100,
plot_bgcolor='white',
paper_bgcolor='white',
title_text="Missing Values Across Genres",
barmode='group',
bargap=0.3,
bargroupgap=0.1
)
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True, tickangle=0, row=1, col=1, title_text="Key Bibiliographical Columns")
fig.update_xaxes(showline=True, linewidth=1, linecolor='black', mirror=True, tickangle=0, row=1, col=2, title_text="Key Bibiliographical Columns")
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True, row=1, col=1, title_text="Percentage of Missing Values (%)")
fig.update_yaxes(showline=True, linewidth=1, linecolor='black', mirror=True, row=1, col=2, title_text="Percentage of Missing Values (%)")
# annotate least missing genre on right subplot
fig.add_annotation(
x=least_missing_genre,
y=least_missing_value + 2,
text=f"Least Missing Values",
showarrow=True,
arrowhead=2,
ax=0,
ay=-30,
bgcolor="lightgreen",
bordercolor="green",
row=1, col=2
)
fig.show()
2.1.7 Missing Values in Key Bibliographical Columns Across Genres
$$ \text{Total Missing \% per Genre} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{\text{Missing cells in Column}_i}{\text{Total rows in genre}} \times 100 \right) $$
2.2 Probing Inside the Comedy Drama Subdataset (Exploratory Data Analysis)¶
- How are British comedy dramas in the dataset distributed across decades between the 17th and 19th centuries?
- What was the network of publishing locations for British comedy dramas between 17th–19th centuries?
- How many unique authors can be identified in proportion to missing values? What percentage of unique authors have a Wikidata record? What is the distribution of authors across centuries?
- Who were the most prolific English comedy authors with the most number of plays in the 17th, 18th, and 19th centuries?
- What does the frequency distribution of English comedy dramas per author reveal about the concentration of comedy publishing in this period?
- What is the overall gender ratio of English comedy authors in the dataset and how has the gender ratio changed across centuries?
- How does the concentration of English comedy dramas differ between female and male authors?
2.2.1 Temporal Distribution of Comedy Dramas by Decade Between 17th-19th Centuries¶
- How are British comedy dramas in the dataset distributed across decades between the 17th and 19th centuries?
#display html
display(HTML("<h3>2.2.1 Temporal Distribution of Comedy Dramas by Decade Between 17th-19th Centuries</h3>"))
#using dtime from 2.1.4 section
#using plotly to plot histogram
fig = px.histogram(
df_comedy_dtime,
x='decade',
nbins=len(df_comedy_dtime['decade'].unique()),
title='Distribution of English Comedy Dramas by Decade Between 17th-19th Centuries',
labels={
'decade': 'Decade',
'count': 'Number of Comedy Dramas'
},
color_discrete_sequence=['#F3A533'],
text_auto=True
)
#adjust layout
fig.update_layout(
xaxis=dict(
tickmode='linear',
dtick=10,
range=[1530, 1900],
tickangle=45,
tickfont=dict(size=10),
showline=True,
linewidth=1,
linecolor='black',
title_text="Publishing Year (Decade)"
),
yaxis=dict(
tickformat='d',
showline=True,
linewidth=1,
linecolor='black',
title_text="Number of Comedy Dramas"
),
template='plotly_white',
width=1075,
height=600
)
fig.update_traces(
marker_line_color='black',
marker_line_width=1,
textposition='outside'
)
#show histogram
fig.show()
2.2.1 Temporal Distribution of Comedy Dramas by Decade Between 17th-19th Centuries
2.2.2 Publishing Locations Network for Comedy Dramas Between 17th to 19th Centuries¶
What was the network of publishing locations for British comedy dramas between 17th–19th centuries?
# prepare comedy dataset
df_comedy_gmap = df_comedy.copy()
# remove missing locations
df_comedy_gmap = df_comedy_gmap[df_comedy_gmap['p location'].notna()]
# split multiple locations like "london & new york"
df_comedy_gmap['p location'] = df_comedy_gmap['p location'].str.split('&')
# explode so each row has one city
df_comedy_gmap = df_comedy_gmap.explode('p location')
# clean text
df_comedy_gmap['p location'] = df_comedy_gmap['p location'].str.strip().str.lower()
# map cities to countries using pandas.map
#used openrefine to find all the cities
city_to_country = {
'london': 'UK', 'oxford': 'UK', 'cambridge': 'UK', 'edinburgh': 'UK',
'liverpool': 'UK', 'manchester': 'UK', 'bristol': 'UK', 'glasgow': 'UK',
'brighton': 'UK', 'coventry': 'UK', 'warrington': 'UK', 'newcastle upon tyne': 'UK',
'margate': 'UK', 'tetbury': 'UK', 'penzance': 'UK', 'in the savoy': 'UK', 'london in the savoy': 'UK',
'new york': 'USA', 'boston': 'USA', 'cambridge, u.s.a': 'USA',
'dublin': 'Ireland', 'toronto': 'Canada', 'bombay': 'India'
}
df_comedy_gmap['country'] = df_comedy_gmap['p location'].map(city_to_country)
# count entries per city ---
city_counts = (
df_comedy_gmap
.groupby(['p location', 'country'])
.size()
.reset_index(name='count')
.sort_values(['count', 'p location'], ascending=[False, True])
)
#using plotly bubble map to make world map vs uk map
#documentation: https://plotly.com/python/bubble-maps/
display(HTML("<h3>2.2.2 17th–19th Century Comedy Drama Publishing Locations: Global vs UK Network</h3>"))
# city_coords: map each city to (latitude, longitude)
#used GenAI to help with coordinates mapping
city_coords = {
'london': (51.5074, -0.1278), 'oxford': (51.7520, -1.2577), 'cambridge': (52.2053, 0.1218),
'edinburgh': (55.9533, -3.1883), 'liverpool': (53.4084, -2.9916), 'manchester': (53.4808, -2.2426),
'bristol': (51.4545, -2.5879), 'glasgow': (55.8642, -4.2518), 'brighton': (50.8225, -0.1372),
'coventry': (52.4068, -1.5197), 'warrington': (53.3906, -2.5930), 'newcastle upon tyne': (54.9783, -1.6178),
'margate': (51.3813, 1.3862), 'tetbury': (51.5872, -2.1532), 'penzance': (50.1180, -5.5370),
'in the savoy': (51.5128, -0.1225), 'london in the savoy': (51.5128, -0.1225),
'new york': (40.7128, -74.0060), 'boston': (42.3601, -71.0589), 'cambridge, u.s.a': (42.3666, -71.1057),
'dublin': (53.3331, -6.2489), 'toronto': (43.6532, -79.3832), 'bombay': (19.0760, 72.8777)
}
# add country and coordinates columns to the city_counts DataFrame
#documentation for pandas.map() used
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html
city_counts['country'] = city_counts['p location'].map(city_to_country)
city_counts['latitude'] = city_counts['p location'].map(lambda x: city_coords.get(x, (None, None))[0])
city_counts['longitude'] = city_counts['p location'].map(lambda x: city_coords.get(x, (None, None))[1])
# remove any rows that do not have coordinates or country info
city_counts_geo = city_counts.dropna(subset=['latitude','longitude','country'])
#customizing colors
country_colors = {
'UK': '#1f77b4', # blue
'USA': '#ff7f0e', # orange
'Ireland': '#2ca02c',# green
'Canada': '#d62728', # red
'India': '#9467bd' # purple
}
#creating side by side comparison subplots
#using subplots in plotly
fig = make_subplots(
rows=1, cols=2,
specs=[[{'type':'scattergeo'}, {'type':'scattergeo'}]], # both are maps
subplot_titles=("World Map", "UK Map") # titles above each map
)
# adding bubble geo maps
for country in city_counts_geo['country'].unique():
df_c = city_counts_geo[city_counts_geo['country']==country]
# scale bubble size based on city count (having min-max to not skew to london visually)
sizes = 5 + (df_c['count'] / df_c['count'].max()) * (15 - 5)
# actual city points
fig.add_trace(go.Scattergeo(
lon=df_c['longitude'],
lat=df_c['latitude'],
text=df_c['p location'] + " (" + df_c['count'].astype(str) + ")",
mode='markers',
marker=dict(
size=sizes,
color=country_colors[country],
line=dict(width=0.5, color='black'),
sizemode='diameter',
opacity=0.7
),
name=country,
showlegend=False
), row=1, col=1)
# add a "dummy" point for the legend (so country shows in legend)
#used GenAI help for this
fig.add_trace(go.Scattergeo(
lon=[None], lat=[None],
mode='markers',
marker=dict(size=10, color=country_colors[country]),
name=country,
showlegend=True
), row=1, col=1)
# add bubble maps
uk_counts = city_counts_geo[city_counts_geo['country']=='UK'].copy()
# use sqrt of Count to make bubble sizes more visually balanced
sqrt_counts = np.sqrt(uk_counts['count'])
scaled_sizes = 5 + (sqrt_counts - sqrt_counts.min()) / (sqrt_counts.max() - sqrt_counts.min()) * (25 - 5)
uk_counts['marker_size'] = scaled_sizes
fig.add_trace(go.Scattergeo(
lon=uk_counts['longitude'],
lat=uk_counts['latitude'],
text=uk_counts['p location'] + " (" + uk_counts['count'].astype(str) + ")",
mode='markers',
marker=dict(
size=uk_counts['marker_size'],
color=country_colors['UK'],
opacity=0.7,
line=dict(width=0.5, color='darkblue')
),
name='UK',
showlegend=False # hide UK from legend here
), row=1, col=2)
# maps layout adjustment
# world map settings
fig.update_geos(
scope='world',
showland=True,
landcolor='rgb(240,240,240)',
countrycolor='gray',
showcoastlines=True,
coastlinecolor='gray',
resolution=110,
row=1, col=1
)
# uk map settings
fig.update_geos(
scope='europe',
showland=True,
landcolor='rgb(240,240,240)',
countrycolor='gray',
showcoastlines=True,
coastlinecolor='gray',
lonaxis_range=[-10,3], # zoom in to UK longitude
lataxis_range=[49.5,59], # zoom in to UK latitude
resolution=50,
row=1, col=2
)
# title
fig.update_layout(
height=600,
width=1100,
title_text="Publishing Locations for Comedy Dramas Between 17th to 19th Century: World vs UK",
legend_title_text="Countries"
)
# show the figure
fig.show()
2.2.2 17th–19th Century Comedy Drama Publishing Locations: Global vs UK Network
2.2.3 Overview of British Comedy Drama Authors in the Dataset¶
- How many unique authors can be identified in proportion to missing values?
- What percentage of unique authors have a Wikidata record?
- What is the distribution of authors across centuries?
display(HTML("<h3>2.2.5 British Comedy Dramas Authors Overview</h3>"))
# make a copy of the dataset
df_author_overview = df_comedy.copy()
# total unique authors vs missing values
author_present = df_author_overview['author (n)'].dropna().nunique()
author_null = df_author_overview['author (n)'].isna().sum()
# unique authors with/without Wikidata
df_unique_authors = df_author_overview[['author (n)', 'author_wikidata_url']].dropna(subset=['author (n)']).drop_duplicates()
wikidata_present_unique = df_unique_authors['author_wikidata_url'].notna().sum()
wikidata_null_unique = df_unique_authors['author_wikidata_url'].isna().sum()
# unique authors across centuries
df_nonnull_authors = df_author_overview[df_author_overview['author (n)'].notna()].copy()
df_nonnull_authors['p year'] = pd.to_numeric(df_nonnull_authors['p year'], errors='coerce')
df_nonnull_authors = df_nonnull_authors.dropna(subset=['p year'])
df_nonnull_authors['century'] = (df_nonnull_authors['p year'] // 100)*100
# count unique authors per century
century_counts = df_nonnull_authors.groupby('century')['author (n)'].nunique().sort_index()
# create subplots
fig = make_subplots(
rows=1, cols=3,
specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]],
subplot_titles=[
"1. Total Unique Authors vs Null",
"2. Authors with/without Wikidata",
"3. Unique Authors Across Centuries"
]
)
#pie chart in plotly
#documentation help taken
#https://plotly.com/python/pie-charts/
# Pie 1
fig.add_trace(go.Pie(
labels=['Total Unique Authors', 'Null Values'],
values=[author_present, author_null],
name="Authors",
marker=dict(colors=['#9A3400', '#380202'])
), 1, 1)
# Pie 2
fig.add_trace(go.Pie(
labels=['With Wikidata', 'Without Wikidata'],
values=[wikidata_present_unique, wikidata_null_unique],
marker=dict(colors=['cyan', 'darkblue']),
name="Wikidata URL"
), 1, 2)
# Pie 3
century_colors = ['#00086F', '#F95959', '#FFC95F', 'palegreen']
century_labels = [] # empty list to store clean century labels
century_counts = century_counts.sort_index()
for c in century_counts.index:
clean_century = int(c) # convert 1700.0 → 1700
century_labels.append(str(clean_century)+'s') # convert to string and store
fig.add_trace(
go.Pie(
labels=century_labels,
values=century_counts.values,
name="Authors by Century",
marker=dict(colors=century_colors[:len(century_counts)]),
sort=False
),
1, 3
)
# adjust layout
fig.update_layout(
title_text="Overview of British Comedy Dramas Authors in the Dataset",
width=1100,
height=500
)
#create figure
fig.show()
2.2.5 British Comedy Dramas Authors Overview
2.2.4 Most Profilic English Comedy Authors with Most Number of Plays by Century¶
Who were the most prolific English comedy authors with the most number of plays in the 17th, 18th, and 19th centuries?
# Make a copy of comedy dataset to find authors
df_authors = df_comedy.copy()
# Clean 'p year' and remove invalid values
df_authors['p year'] = pd.to_numeric(df_authors['p year'], errors='coerce')
df_authors = df_authors.dropna(subset=['p year'])
# Add 'century' column in the working authors dataset
df_authors['century'] = (df_authors['p year'] // 100 + 1).astype(int)
# Count plays per author per century including Wikidata URL
author_counts = df_authors.groupby(['century', 'author (n)', 'author_wikidata_url']).size().reset_index(name='num_plays')
# Keep only 17th, 18th, 19th century
author_counts = author_counts[author_counts['century'].isin([17, 18, 19])]
# Sort by century and number of plays descending
author_counts_sorted = author_counts.sort_values(['century', 'num_plays'], ascending=[True, False])
# Take top 10 authors per century
top10_per_century = author_counts_sorted.groupby('century').head(10).reset_index(drop=True)
#making the visualization
display(HTML("<h3>2.2.4 Most Profilic English Comedy Authors Based on Number of Plays by Century</h3>"))
# wikidata client to fetch image
client = Client()
#manual image for one author not available on wikidata, but on wikipedia
manual_image = {
"dibdin, thomas": "https://raw.githubusercontent.com/chahna-ahuja/dh_project/refs/heads/main/visual_assets/thomas%20dibdin.jpg"
}
# placeholder image if portraits not found
placeholder_image = "https://raw.githubusercontent.com/chahna-ahuja/dh_project/refs/heads/main/visual_assets/No_Image_Available.jpg"
# fetch image
#inspired from documentation of wikidata package for python
#https://pypi.org/project/Wikidata/
def fetch_author_image(author_name, wikidata_url):
author_key = author_name.lower().strip()
if author_key in manual_image:
return manual_image[author_key]
# Try Wikidata
try:
qid = wikidata_url.split('/')[-1].split('#')[0]
entity = client.get(qid, load=True)
image_prop = client.get('P18') # P18 = image
if image_prop in entity:
return entity[image_prop].image_url
except:
pass
# retun placeholder if nothing found
return placeholder_image
# add image urls to dataframe
image_urls = [] # create empty list
for i in range(len(top10_per_century)):
author_name = top10_per_century.loc[i, 'author (n)']
wikidata_url = top10_per_century.loc[i, 'author_wikidata_url']
img_url = fetch_author_image(author_name, wikidata_url)
image_urls.append(img_url)
top10_per_century['image_url'] = image_urls
# function to create html gallery
#inspiration from this github user
#https://github.com/drduh/python-html-gallery/blob/master/static.py
def create_gallery(df):
html = '<div style="display:flex; flex-wrap:wrap;">'
# Loop through centuries
centuries = sorted(df['century'].unique())
for century in centuries:
html += f'<h4 style="width:100%;">Top 10 Comedy Authors – {century}th Century</h4>'
century_df = df[df['century'] == century]
# Loop through authors in this century
for index, row in century_df.iterrows():
author_name = row['author (n)']
num_plays = row['num_plays']
img_url = row['image_url']
# Add HTML for this author
html += f'''
<div style="margin:12px; text-align:center; width:150px;">
<img src="{img_url}"
alt="{author_name}"
style="
width:120px;
height:150px;
object-fit:contain;
background:#f4f4f4;
border-radius:10px;
border:1px solid #ccc;
">
<p style="margin:6px 0 0 0; font-weight:bold;">{author_name}</p>
<p style="margin:0; font-size:0.85em; color:gray;">Comedy Dramas: {num_plays}</p>
</div>
'''
html += '</div>'
return html
# display gallery
gallery_html = create_gallery(top10_per_century)
display(HTML(gallery_html))
2.2.4 Most Profilic English Comedy Authors Based on Number of Plays by Century
Top 10 Comedy Authors – 17th Century
shirley, james
Comedy Dramas: 14
shadwell, thomas
Comedy Dramas: 13
behn, aphra
Comedy Dramas: 10
d'urfey, thomas
Comedy Dramas: 8
ravenscroft, edward
Comedy Dramas: 7
crowne, john
Comedy Dramas: 6
dryden, john
Comedy Dramas: 6
beaumont, francis
Comedy Dramas: 5
brome, richard
Comedy Dramas: 5
middleton, thomas
Comedy Dramas: 5
Top 10 Comedy Authors – 18th Century
cumberland, richard
Comedy Dramas: 14
foote, samuel
Comedy Dramas: 14
murphy, arthur
Comedy Dramas: 8
cowley, hannah
Comedy Dramas: 6
andrews, miles peter
Comedy Dramas: 4
inchbald, elizabeth
Comedy Dramas: 4
d'urfey, thomas
Comedy Dramas: 3
holcroft, thomas
Comedy Dramas: 3
kotzebue, august friedrich ferdinand von
Comedy Dramas: 3
wycherley, william
Comedy Dramas: 3
Top 10 Comedy Authors – 19th Century
robertson, thomas william
Comedy Dramas: 6
dibdin, thomas
Comedy Dramas: 5
arnold, samuel james
Comedy Dramas: 4
cumberland, richard
Comedy Dramas: 4
morton, thomas
Comedy Dramas: 4
colman, george
Comedy Dramas: 3
hayes, frederick william
Comedy Dramas: 3
meredith, george
Comedy Dramas: 3
tobin, john
Comedy Dramas: 3
chambers, marianne
Comedy Dramas: 2
2.2.5 Authorship Concentration in English Comedy Dramas Across 17th-19th Centuries¶
What does the frequency distribution of English comedy dramas per author reveal about the concentration of comedy publishing in this period?
display(HTML("<h3>2.2.5 Authorship Concentration in English Comedy Dramas Across 17th-19th Centuries</h3>"))
# Make a copy of the comedy dataset to work with authorship data
df_authorsdis = df_comedy.copy()
# Convert 'p year' to numbers (invalid values become NaN)
df_authorsdis['p year'] = pd.to_numeric(
df_authorsdis['p year'],
errors='coerce'
)
# Remove rows where publication year is missing
df_authorsdis = df_authorsdis.dropna(subset=['p year'])
# Create a 'century' column from the publication year
df_authorsdis['century'] = (df_authorsdis['p year'] // 100 + 1).astype(int)
# Count how many plays each author published per century
author_counts = (
df_authorsdis
.groupby(['century', 'author (n)'])
.size()
.reset_index(name='num_plays')
)
# Keep only the 17th, 18th, and 19th centuries
author_counts = author_counts[
author_counts['century'].isin([17, 18, 19])
]
# remove null authors
author_counts_clean = author_counts[author_counts['author (n)'].notna()]
# separate data per century and sort descending by number of plays
authors_17 = author_counts_clean[author_counts_clean['century'] == 17].sort_values('num_plays', ascending=False)
authors_18 = author_counts_clean[author_counts_clean['century'] == 18].sort_values('num_plays', ascending=False)
authors_19 = author_counts_clean[author_counts_clean['century'] == 19].sort_values('num_plays', ascending=False)
# count authors per century
count_17 = len(authors_17)
count_18 = len(authors_18)
count_19 = len(authors_19)
# determine max number of authors for X-axis scaling
max_authors = max(count_17, count_18, count_19)
# function to compute density curve
#using gaussian_kde https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html
#inspired from this stackflow discussion
#https://stackoverflow.com/questions/59738337/how-to-draw-a-matching-bell-curve-over-a-histogram
def get_density_curve(y_values, max_x):
kde = gaussian_kde(y_values)
y_min, y_max = min(y_values), max(y_values)
y_range = np.linspace(y_min, y_max, 200)
density = kde(y_range)
# scale density to max y of bars
density = density / max(density) * max(y_values)
# x positions aligned to total max authors for consistent scale
x_curve = np.linspace(0, max_x-1, 200)
return x_curve, density
# create subplots for each century
fig = make_subplots(
rows=1, cols=3,
subplot_titles=("17th Century", "18th Century", "19th Century")
)
# plot bars plus curve
for i, (authors, color, count) in enumerate([
(authors_17, 'blue', count_17),
(authors_18, 'orange', count_18),
(authors_19, 'green', count_19)
], start=1):
x_pos = list(range(len(authors)))
y_values = authors['num_plays']
# histogram
fig.add_trace(go.Bar(
x=x_pos,
y=y_values,
marker_color=color,
#hovering shows author name + count of plays they wrote
hovertext=authors['author (n)'],
hoverinfo="text+y"
), row=1, col=i)
# density curve
x_curve, y_curve = get_density_curve(y_values, max_authors)
fig.add_trace(go.Scatter(
x=x_curve,
y=y_curve,
mode='lines',
line=dict(color='black', width=2),
name='Density Curve',
showlegend=(i==1) # only show legend once
), row=1, col=i)
# update x axis
fig.update_xaxes(showticklabels=False, title_text=f"Authors (Total={count})", row=1, col=i,
range=[0, max_authors-1],showline=True,linewidth=1,linecolor='black')
# update y-axis scale
for col in range(1, 4):
fig.update_yaxes(title_text="Number of English Comedy Dramas", row=1, col=col, dtick=2, range=[0,14],showline=True,linewidth=1,linecolor='black')
# Layout
fig.update_layout(
title_text="Authorship Concentration in English Comedy Dramas Across 17th-19th Centuries",
width=1100,
height=500,
template='plotly_white',
showlegend=False
)
fig.show()
2.2.5 Authorship Concentration in English Comedy Dramas Across 17th-19th Centuries
2.2.6 Gender Distribution of English Comedy Authors Across 17th-19th Centuries¶
What is the overall gender ratio of English comedy authors in the dataset and how has the gender ratio changed across centuries?
#fetching data from wikidata using qid
import requests
import time
# prepare unique authors with QIDs using author_overview df from 2.2.3
df_unique_authors = df_author_overview[['author (n)', 'author_wikidata_id']].drop_duplicates()
# function to fetch gender from Wikidata
def fetch_gender(qid):
"""
Returns 'Male', 'Female', 'Other', or 'Unknown' for a Wikidata QID.
"""
if pd.isna(qid):
return 'No Data' # differentiate authors with no Wikidata ID
url = f"https://www.wikidata.org/wiki/Special:EntityData/{qid}.json"
headers = {"User-Agent": "ComedyDHAnalysis/1.0 Python requests"} #to show it is a user agent, not a bot
try:
response = requests.get(url, headers=headers) #request
response.raise_for_status()
data = response.json()
claims = data['entities'][qid]['claims']
if 'P21' in claims: #p 21 is the key for getting gender
gender_id = claims['P21'][0]['mainsnak']['datavalue']['value']['id']
if gender_id == 'Q6581097':
return 'Male'
elif gender_id == 'Q6581072':
return 'Female'
else:
return 'Other'
else:
return 'Unknown' # QID exists but gender not specified
except Exception as e: #if there is an error
print(f"Error fetching {qid}: {e}")
return 'Unknown'
# call function
df_unique_authors['gender'] = df_unique_authors['author_wikidata_id'].apply(fetch_gender)
# delay so it doesn't appear like bot behaviour
time.sleep(0.1)
# count gender as a new df
gender_counts = df_unique_authors['gender'].value_counts()
# df with just male, female authors
df_male_female = df_unique_authors[df_unique_authors['gender'].isin(['Male', 'Female'])].copy()
display(HTML("<h3>2.2.6 Gender Distribution of English Comedy Authors Across 17th-19th Centuries</h3>"))
#creating donut chart from the gender fetched using the function
#using gender_counts df
#donut chart using plotly
#essentially pie chart with a hole
#also creating stacked bar chart for gender distribution across centuries
#data for bar chart
df_author_year = (
df_author_overview[['author (n)', 'p year']]
.dropna()
.groupby('author (n)', as_index=False)['p year']
.min()
)
df_male_female_full = pd.merge(
df_male_female,
df_author_year,
on='author (n)',
how='left'
)
df_male_female_full['century'] = (df_male_female_full['p year'] // 100) * 100
stacked_data = (
df_male_female_full
.groupby(['century', 'gender'])['author (n)']
.nunique()
.reset_index(name='count')
.sort_values('century')
)
# subplots
fig = make_subplots(
rows=1, cols=2,
specs=[[{"type": "domain"}, {"type": "xy"}]],
subplot_titles=("Gender Distribution of Unique Comedy Authors",
"Gender Distribution of Comedy Authors by Century")
)
# donut chart
fig.add_trace(
go.Pie(
labels=gender_counts.index,
values=gender_counts.values,
hole=0.5,
marker=dict(colors=['#550000','#000000','#BD5091','#FFA600']),
textinfo='label+percent',
showlegend=False,
hovertemplate='<b>%{label}</b><br>Count: %{value}<extra></extra>'
),
row=1, col=1
)
# annotation for donut
fig.update_layout(
annotations=[dict(text='Gender Ratio', x=0.23, y=0.5, font_size=20, showarrow=False)]
)
#colors for stacked bars
color_map = {'Male': '#550000', 'Female': '#BD5091'}
# stacked bar chart
for gender in ['Male', 'Female']:
gender_df = stacked_data[stacked_data['gender'] == gender]
fig.add_trace(
go.Bar(
x=[f"{int(c)}s" for c in gender_df['century']],
y=gender_df['count'],
name=gender,
marker_color=color_map[gender]
),
row=1, col=2
)
# adjust layout stacked bar
fig.update_yaxes(title_text="Number of Unique Authors", row=1, col=2,showline=True, linewidth=1, linecolor='black', mirror=True)
fig.update_xaxes(title_text="Centuries", row=1, col=2,showline=True, linewidth=1, linecolor='black', mirror=True)
# adjust total layout
fig.update_layout(
height=600,
width=1000,
title_text="Gender Distribution of English Comedy Authors Across 17th-19th Centuries",
barmode="stack",
paper_bgcolor='white',
plot_bgcolor='white'
)
fig.show()
2.2.6 Gender Distribution of English Comedy Authors Across 17th-19th Centuries
2.2.7 Differences in Concentration of Comedy Dramas Between Male and Female Authors Across 17th-19th Centuries¶
How does the concentration of English comedy dramas differ between female and male authors?
display(HTML("<h3>2.2.7 Differences in Concentration of Comedy Dramas Between Male and Female Authors Across 17th-19th Centuries</h3>"))
# counting author's plays along with their gender
#first counting plays by authors across centuries same as 2.2.4
df_authors_gender = df_comedy.copy()
df_authors_gender['p year'] = pd.to_numeric(df_authors_gender['p year'], errors='coerce')
df_authors_gender = df_authors_gender.dropna(subset=['p year'])
df_authors_gender['century'] = (df_authors_gender['p year'] // 100) * 100
df_authors_gender = df_authors_gender[df_authors_gender['century'].isin([1500,1600,1700,1800])]
author_play_counts = (
df_authors_gender
.groupby(['century', 'author (n)', 'author_wikidata_id'])
.size()
.reset_index(name='num_plays')
)
#second merging data with gender
author_play_counts = pd.merge(
author_play_counts,
df_unique_authors[['author (n)', 'gender']], #merging data with gender
on='author (n)',
how='left'
)
#refining labels for visualization
author_play_counts_mf = author_play_counts[author_play_counts['gender'].isin(['Male','Female'])].copy()
author_play_counts_mf['century_label'] = author_play_counts_mf['century'].astype(int).astype(str)+'s'
century_color_map = {
1500: '#636EFA',
1600: '#EF553B',
1700: '#00CC96',
1800: '#AB63FA'
}
#split into two dfs for two treemaps
df_female = author_play_counts_mf[author_play_counts_mf['gender']=='Female']
df_male = author_play_counts_mf[author_play_counts_mf['gender']=='Male']
#create subpots
fig = make_subplots(
rows=2, cols=1,
specs=[[{"type":"treemap"}],
[{"type":"treemap"}]],
subplot_titles=("Female Comedy Authors by Century", "Male Comedy Authors by Century"),
vertical_spacing=0.025
)
# female treemap
fig_female = px.treemap(
df_female,
path=['author (n)'],
values='num_plays',
color='century',
color_discrete_map=century_color_map
)
# add traces for female treemap
for trace in fig_female.data:
fig.add_trace(trace, row=1, col=1)
# male treemap
fig_male = px.treemap(
df_male,
path=['author (n)'],
values='num_plays',
color='century',
color_discrete_map=century_color_map
)
# add traces for male treemap
for trace in fig_male.data:
fig.add_trace(trace, row=2, col=1)
# adjust layout
fig.update_layout(
height=1500,
width=800,
title_text="Treemaps Showing Concentration of Comedy Dramas by Gender and Century",
paper_bgcolor='white',
plot_bgcolor='white',
)
# hover info
fig.update_traces(
hovertemplate='<b>%{label}</b><br>Plays: %{value}<extra></extra>'
)
#show both treemaps
fig.show()
2.2.7 Differences in Concentration of Comedy Dramas Between Male and Female Authors Across 17th-19th Centuries
3. Exploratory Text Network Analysis¶
- Text Preprocessing Pipeline
- Lexical Patterns in Title
- Syntactic Dependency Network
nlp = spacy.load("en_core_web_sm")
class TitlePreprocessor:
def __init__(self, use_domain_stopwords=False):
# initialize list of stopwords and metadata removal for this dataset
# standard English stopwords
self.stop_words = set(stopwords.words('english'))
# domain specific stopwords (like 'comedy', 'act', etc.)
self.domain_stopwords = set([
'comedy', 'comedie','comedies', 'play', 'poem', 'poetry', 'prose',
'tragicomedy', 'tragi', 'musical', 'farcical',
'act', 'acts', 'edition', 'etc',
'original', 'translated','tragiccomedy','comedietta','pleasant'
])
# can be turned off/on
self.use_domain_stopwords = use_domain_stopwords
# metadata patterns (remove everything after these)
self.number_words_regex = r'(one|two|three|four|five|six|seven|eight|nine)'
self.metadata_patterns = [
r'\bwritten\s+by\b.*',
r'\bas\s+it\s+was\b.*',
r'\bas\s+it\s+is\b.*',
r'\bas\s+acted\b.*',
r'\bas\s+performed\b.*',
r'\btranslated\b.*',
r'[\s,;:.]*\b\d+(st|nd|rd|th|another)?\s+edition\b',
r'[\s,;:.]*\bedition\b',
r'[\s,;:.]*\betc\.?\b',
r'\bcalled\b',
r'\bentituled\b',
r'\blater entitled\b',
r'\b{one|two|three|four|five|six|seven|eight|nine}\s+in\s+act[s]?\b',
r'\bin\s+{one|two|three|four|five|six|seven|eight|nine}\s+act[s]?\b',
r'\bin\s+{one|two|three|four|five|six|seven|eight|nine}\s+act[s]?\b'
]
# 1. remove metadata patterns
def remove_metadata(self, title):
if pd.isna(title):
return ''
cleaned_title = title
for pattern in self.metadata_patterns:
cleaned_title = re.sub(pattern, '', cleaned_title, flags=re.IGNORECASE)
return cleaned_title.strip()
# 2. tokenization
def tokenize(self, title, remove_stopwords=True):
if pd.isna(title):
return []
# remove metadata first
title = self.remove_metadata(title)
# lowercase
title = title.lower()
# remove punctuation and numbers
title = re.sub(r'[^a-z\s]', '', title)
# tokenize
tokens = nltk.word_tokenize(title)
filtered_tokens = []
for token in tokens:
if token in string.punctuation:
continue
if token in self.number_words_regex:
continue
# Remove standard stopwords if enabled
if remove_stopwords and token in self.stop_words:
continue
# Remove domain stopwords only if enabled
if self.use_domain_stopwords and token in self.domain_stopwords:
continue
filtered_tokens.append(token)
return filtered_tokens
# 3. remove duplicate token lists at corpus level
def remove_duplicate_titles(self, token_lists):
unique_token_lists = []
seen = []
for tokens in token_lists:
token_tuple = tuple(tokens) # lists cannot be checked for equality reliably
if token_tuple not in seen:
seen.append(token_tuple)
unique_token_lists.append(tokens)
return unique_token_lists
# 4. lemmatization
def lemmatize(self, tokens):
"""
Returns list of lemmas for a token list
"""
doc = nlp(" ".join(tokens))
lemmas = []
for token in doc:
lemmas.append(token.lemma_)
return lemmas
# 5. flatten token lists
def flatten_tokens(self, token_lists):
"""
Flatten list of token lists into a single list
"""
flat_list = []
for tokens in token_lists:
for token in tokens:
flat_list.append(token)
return flat_list
3.2 Lexical Patterns in Titles: Text Analysis¶
- How does the token length of English comedy titles inform our ability to extract meaningful lexical patterns and co-occurrence networks?
- What can frequent lemmatized words reveal about the thematic and social preoccupations reflected in English comedy drama titles?
3.2.1 Title Length and Token Distribution¶
How does the token length of English comedy titles inform our ability to extract meaningful lexical patterns and co-occurrence networks?
df_tokendis = df_comedy.copy()
# preprocess titles, remove domain stopwords
preprocessor = TitlePreprocessor(use_domain_stopwords=True)
titles = []
token_lists = []
#keep functional words so dont remove stopwords
for title in df_tokendis['title (n)']:
tokens = preprocessor.tokenize(title, remove_stopwords=False)
#remove duplicates
if tokens not in token_lists:
titles.append(title)
token_lists.append(tokens)
# lemmatize each title
preprocessed_list = [preprocessor.lemmatize(tokens) for tokens in token_lists]
# token counts
token_counts = [len(tokens) for tokens in preprocessed_list]
# build final df
df_tokendis_unique = pd.DataFrame({
'title': titles,
'tokens': preprocessed_list,
'token_count': token_counts
})
display(HTML("<h3>3.2.1 Distribution of Token Counts per English Comedy Title</h3>"))
# boxplot of token counts per title – polished version
fig = px.box(
df_tokendis_unique,
x='token_count',
points='all', # show individual points
color_discrete_sequence=['#F3A533'], # consistent color
hover_data={'title': True, 'tokens': True, 'token_count': True}, # show token info
labels={'token_count': 'Number of Tokens per Title'},
title="Distribution of Tokens per Comedy Title (Content + Function Words, Domain Stopwords Removed)"
)
#adjust layout
fig.update_layout(
title_font_size=20,
title_x=0.5, # center the title
xaxis_title_font_size=16,
yaxis_visible=False,
xaxis_visible=True,
yaxis_showticklabels=True,
boxmode='overlay',
plot_bgcolor='white',
margin=dict(l=40, r=40, t=80, b=40),
)
# Update box styling
fig.update_traces(
marker=dict(size=6, opacity=0.6, line=dict(width=0.5, color='DarkOrange')),
line=dict(width=2)
)
fig.show()
3.2.1 Distribution of Token Counts per English Comedy Title
3.2.2 Most Frequent Lemmas in English Comedy Drama Titles¶
What can frequent lemmatized words reveal about the thematic and social preoccupations reflected in English comedy drama titles?
df_text=df_comedy.copy()
#preprocess titles , remove domain stopwords
preprocessor = TitlePreprocessor(use_domain_stopwords=True)
# tokenize titles (removes metadata, usual stopwords)
token_lists = []
for title in df_text['title (n)']:
tokens = preprocessor.tokenize(title, remove_stopwords=True)
token_lists.append(tokens)
# remove duplicates at title level
unique_token_lists = preprocessor.remove_duplicate_titles(token_lists)
# lemmatize each title
preprocessed_list = []
for tokens in unique_token_lists:
lemmas = preprocessor.lemmatize(tokens)
preprocessed_list.append(lemmas)
# flatten all lemmas into one list
flat_tokens = preprocessor.flatten_tokens(preprocessed_list)
# build frequency table
freq_counter = Counter(flat_tokens)
# convert to df for visualization
freq_df = pd.DataFrame(freq_counter.items(), columns=['word', 'frequency'])
# sort by frequency descending
freq_df = freq_df.sort_values(by='frequency', ascending=False).reset_index(drop=True)
# horizontal bar chart
top50 = freq_df.head(50)
bar_colors = [f'rgba(243, 165, 51, {0.3 + 0.7*(f/top50["frequency"].max())})'
for f in top50['frequency']]
bar_trace = go.Bar(
x=top50['frequency'],
y=top50['word'],
orientation='h',
marker_color=bar_colors
)
# wordcloud
top20_freq = dict(freq_df.head(20).values)
wc = WordCloud(width=800, height=400, background_color='white', colormap='viridis').generate_from_frequencies(top20_freq)
wc_img = np.array(wc.to_image())
wc_fig = px.imshow(wc_img)
wc_trace = wc_fig.data[0]
# create subplots
fig = make_subplots(
rows=1, cols=2,
column_widths=[0.6, 0.4],
subplot_titles=("Top 50 Lemmatized Word Frequencies", "Word Cloud of Top 20 Lemmatized Words")
)
fig.add_trace(bar_trace, row=1, col=1)
fig.add_trace(wc_trace, row=1, col=2)
fig.update_xaxes(title_text="Frequency", row=1, col=1, showgrid=True, zeroline=True, matches=None)
fig.update_yaxes(title_text="Words", row=1, col=1, autorange="reversed", matches=None)
fig.update_xaxes(visible=False, row=1, col=2)
fig.update_yaxes(visible=False, row=1, col=2)
# layout adjust
fig.update_layout(
height=900, width=1100,
showlegend=False,
title_text="Most Frequent Lemmas in English Comedy Drama Titles",
plot_bgcolor='white'
)
fig.show()
3.3 Syntactic Dependency Network of Top 20 Lemmas¶
Hypothesis: 17th–19th century British comedy titles encode gendered roles and archetypes, offering insight into the social expectations of the period.
Create a CSV of nodes and edges for network analysis in Gephi
#load spacy
nlp = spacy.load("en_core_web_sm")
# working df for network analysis
df_network = df_comedy.copy()
# just remove metadata
#as we are using dependency parsing in spacy
#there is no need to preprocess
#as spacy will handle tokenization and pos tagging
def remove_metadata(title):
if pd.isna(title):
return ''
metadata_patterns = [
r'\bwritten\s+by\b.*',
r'\bas\s+it\s+was\b.*',
r'\bas\s+it\s+is\b.*',
r'\bas\s+acted\b.*',
r'\bas\s+performed\b.*',
r'\btranslated\b.*',
r'[\s,;:.]*\b\d+(st|nd|rd|th|another)?\s+edition\b',
r'[\s,;:.]*\bedition\b',
r'[\s,;:.]*\betc\.?\b',
r'\bcalled\b',
r'\bentituled\b',
r'\blater entitled\b'
]
cleaned = title
for pattern in metadata_patterns:
cleaned = re.sub(pattern, '', cleaned, flags=re.IGNORECASE)
return cleaned.strip()
# top 20 lemmas we found from the earlier section
hub_words = {
'lover','wife','school','love','lady','husband','new','man','old','day',
'marriage','country','widow','false','friend','sir','wit','town','daughter','cuckold','right'
}
# building edges and nodes for network analysis
edges = []
nodes = defaultdict(dict)
for title in df_network['title (n)'].dropna():
cleaned = remove_metadata(title)
doc = nlp(cleaned)
for token in doc:
# adj -> noun (hub)
if token.pos_ == 'ADJ' and token.dep_ == 'amod':
head = token.head.lemma_
if head in hub_words:
edges.append((token.lemma_, head))
nodes[token.lemma_]['NodeType'] = 'ADJ'
nodes[head]['NodeType'] = 'HUB'
# verb -> noun (hub)
if token.pos_ == 'VERB':
for child in token.children:
if child.dep_ in ['nsubj', 'dobj'] and child.lemma_ in hub_words:
edges.append((token.lemma_, child.lemma_))
nodes[token.lemma_]['NodeType'] = 'VERB'
nodes[child.lemma_]['NodeType'] = 'HUB'
# hub noun → prep → NOUN (e.g. school of manners)
if token.lemma_ in hub_words:
for child in token.children:
if child.dep_ == 'prep':
for obj in child.children:
if obj.dep_ == 'pobj':
edges.append((token.lemma_, obj.lemma_))
nodes[token.lemma_]['NodeType'] = 'HUB'
nodes[obj.lemma_]['NodeType'] = 'NOUN'
# edges df
df_edges = pd.DataFrame(edges, columns=['Source', 'Target'])
df_edges['Weight'] = 1
# ensure every node in edges is in nodes for gephi
all_nodes = set(df_edges['Source']).union(df_edges['Target'])
clean_nodes = []
for node in all_nodes:
node_type = nodes[node].get('NodeType', 'NOUN')
clean_nodes.append({'Id': node, 'NodeType': node_type})
df_nodes_clean = pd.DataFrame(clean_nodes)
# save csv
edges_path = "network_edges.csv"
nodes_path = "network_nodes.csv"
if os.path.exists(edges_path):
print(f"{edges_path} already exists, cannot be overwritten.")
else:
df_edges.to_csv(edges_path, index=False)
print(f"{edges_path} saved.")
if os.path.exists(nodes_path):
print(f"{nodes_path} already exists, cannot be overwritten.")
else:
df_nodes_clean.to_csv(nodes_path, index=False)
print(f"{nodes_path} saved.")
# sanity check
print("\nEdges preview:")
print(df_edges.head())
print("\nNodes preview:")
print(df_nodes_clean.head())
network_edges.csv already exists, cannot be overwritten.
network_nodes.csv already exists, cannot be overwritten.
Edges preview:
Source Target Weight
0 day turkey 1
1 day comedy 1
2 day turkey 1
3 school author 1
4 green man 1
Nodes preview:
Id NodeType
0 you NOUN
1 green ADJ
2 ridiculous ADJ
3 town HUB
4 lover NOUN
#in csv manually delete duplicates (which could be from multiple editions with the same title)
#map nodes and edges in gephi
#network analysis below
img=Image(url= "https://raw.githubusercontent.com/chahna-ahuja/dh_project/refs/heads/main/network_analysis/toplemmas_network.png")
display(img)
display(Markdown('''Figure 2: Social Network Analysis of Most Frequent Lemmas (Orange) in Gephi'''))
Figure 2: Social Network Analysis of Most Frequent Lemmas (Orange) in Gephi